Choosing Bucket Boundaries for Histograms

نویسندگان

  • H. V. Jagadish
  • Nick Koudas
  • Kenneth C. Sevcik
چکیده

Histograms have long been used to capture attribute value distribution statistics for query optimizers. More recently, there has been a growing interest in the use of histograms to produce quick approximate answers to decision support queries. This motivates nding good strategies for specifying histogram buckets. Under the assumption that nding optimal bucket boundaries is computationally ineecient, previous research has focused on nding heuristics that produce good solutions. In this paper, we present an algorithm to determine bucket boundaries optimally, in time proportional to the square of the number of distinct data values, for a broad class of optimality metrics. Through experimentation, we show that optimal histograms can have substantially lower reconstruction error than histograms produced according to popular heuristics. We also present a new heuristic, based on our understanding of the optimal solution, which in many cases obtains lower reconstruction error than previously proposed heuristics, with a computation cost that is still quite low.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Histograms with Quality Guarantees

Histograms are commonly used to capture attribute value distribution statistics for query optimizers. More recently, histograms have also been considered as a way to produce quick approximate answers to decision support queries. This widespread interest in histograms motivates the problem of computing his-tograms that are good under a given error metric. In particular, we are interested in an e...

متن کامل

Piecewise Linear Histograms for Selectivity Estimation

Selectivity estimation of queries is of critical importance to query optimization. In order to get accurate estimations, database management systems must maintain statistics to capture the underlying data distribution. Histograms are extensively used in commercial database systems for this purpose. Most current histogram techniques make the assumption that all values in a single bucket appear w...

متن کامل

Histogram refinement for content-based image retrieval

Color histograms are widely used for content-based image retrieval. Their advantages are efficiency, and insensitivity to small changes in camera viewpoint. However, a histogram is a coarse characterization of an image, and so images with very different appearances can have similar histograms. We describe a technique for comparing images called histogram refinement, which imposes additional con...

متن کامل

Histogram Re nement for Content - Based Image RetrievalGreg Pass

Color histograms are widely used for content-based image retrieval. Their advantages are eeciency, and insensitivity to small changes in camera viewpoint. However, a histogram is a coarse characterization of an image, and so images with very diierent appearances can have similar histograms. We describe a technique for comparing images called histogram re-nement, which imposes additional constra...

متن کامل

A nearly optimal and deterministic summary structure for update data streams

We present a deterministic summary structure over update streams that enables deterministic and the first space-optimal algorithms for a variety of problems, including, estimating frequencies, finding approximate frequent items, finding approximate quantiles, finding hierarchical heavy hitters, approximately optimal B-bucket histograms, estimating inner product sizes, etc..

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007